4 research outputs found
Multimodal Speech Emotion Recognition Using Audio and Text
Speech emotion recognition is a challenging task, and extensive reliance has
been placed on models that use audio features in building well-performing
classifiers. In this paper, we propose a novel deep dual recurrent encoder
model that utilizes text data and audio signals simultaneously to obtain a
better understanding of speech data. As emotional dialogue is composed of sound
and spoken content, our model encodes the information from audio and text
sequences using dual recurrent neural networks (RNNs) and then combines the
information from these sources to predict the emotion class. This architecture
analyzes speech data from the signal level to the language level, and it thus
utilizes the information within the data more comprehensively than models that
focus on audio features. Extensive experiments are conducted to investigate the
efficacy and properties of the proposed model. Our proposed model outperforms
previous state-of-the-art methods in assigning data to one of four emotion
categories (i.e., angry, happy, sad and neutral) when the model is applied to
the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.Comment: 7 pages, Accepted as a conference paper at IEEE SLT 201
Speech Emotion Recognition Using Multi-hop Attention Mechanism
In this paper, we are interested in exploiting textual and acoustic data of
an utterance for the speech emotion classification task. The baseline approach
models the information from audio and text independently using two deep neural
networks (DNNs). The outputs from both the DNNs are then fused for
classification. As opposed to using knowledge from both the modalities
separately, we propose a framework to exploit acoustic information in tandem
with lexical data. The proposed framework uses two bi-directional long
short-term memory (BLSTM) for obtaining hidden representations of the
utterance. Furthermore, we propose an attention mechanism, referred to as the
multi-hop, which is trained to automatically infer the correlation between the
modalities. The multi-hop attention first computes the relevant segments of the
textual data corresponding to the audio signal. The relevant textual data is
then applied to attend parts of the audio signal. To evaluate the performance
of the proposed system, experiments are performed in the IEMOCAP dataset.
Experimental results show that the proposed technique outperforms the
state-of-the-art system by 6.5% relative improvement in terms of weighted
accuracy.Comment: 5 pages, Accepted as a conference paper at ICASSP 2019 (oral
presentation
Neural networks for compressing and classifying speaker-independent paralinguistic signals
Recognizing and classifying paralinguistic signals, with its various applications, is an important problem. In general, this task is considered challenging because the sound information from the signals is difficult to distinguish even by humans. Thus, analyzing signals with machine learning techniques is a reasonable approach to understanding signals. Audio features extracted from paralinguistic signals usually consist of high-dimensional vectors such as prosody, energy, cepstrum, and other speech-related information. Therefore, when the size of a training corpus is not sufficiently large, it is extremely difficult to apply machine learning methods to analyze these signals due to their high feature dimensions. This paper addresses these limitations by using neural networks' feature learning abilities. First, we use a neural network-based autoencoder to compress the signal to eliminate redundancy within the signal feature, and we show that the compressed signal features are competitive in distinguishing the signal compared to the original features. Second, we show by experiment that the neural network-based classification model almost always outperforms nonneural methods such as logistic regression, support vector machines, decision trees, and boosted trees.N
MULTIMODAL SPEECH EMOTION RECOGNITION USING AUDIO AND TEXT
Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and then combines the information from these sources to predict the emotion class. This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.N